Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora

نویسندگان

  • Gaël Dias
  • Sylvie Guilloré
  • Gabriel Pereira Lopes
چکیده

Multiword units are groups of words that occur together more often than expected by chance in sub-languages. Président de la République, Coupe du monde and Traité de Maastricht are multiword units. Unfortunately, most of the machine-readable dictionaries contain clearly insufficient information about multiword units. Therefore, their automatic extraction from corpora is an important issue not only for natural language processing but also for applications on Information Retrieval, Information Extraction and Machine Translation. In this paper, we propose a new extraction system based on a new association measure, the Mutual Expectation, and a new acquisition process based on an algorithm of local maxima, the LocalMax algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Aspects of Multiword Lexical Units

As most of the machine-readable dictionaries contain clearly insufficient information about multiword lexical units, there is a constant need to extend and tune specialized lexical databases to account for new expressions. In this paper, we present a system exclusively based on statistics that massively extracts from unrestricted text corpora contiguous and noncontiguous rigid multiword lexical...

متن کامل

A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications

This paper presents an open and flexible methodological framework for the automatic acquisition of multiword expressions (MWEs) from monolingual textual corpora. This research is motivated by the importance of MWEs for NLP applications. After briefly presenting the modules of the framework, the paper reports extrinsic evaluation results considering two applications: computer-aided lexicography ...

متن کامل

Time-Independent and Language-Independent Extraction of Multiword Expressions From Twitter

Multiword Expressions (MWEs) are crucial lexico-semantic units in any language. However, most work on MWEs has been focused on standard monolingual corpora. In this work, we examine MWE usage on Twitter an inherently multilingual medium with an extremely short average text length that is often replete with grammatical errors. In this work we present a new graph based, language agnostic method f...

متن کامل

Yet Another Ranking Function for Automatic Multiword Term Extraction

Term extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to t...

متن کامل

Une plate-forme générique et ouverte pour le traitement des expressions polylexicales (An Open and Generic Framework for the Acquisition of Multiword Expressions) [in French]

An Open and Generic Framework for the Acquisition of Multiword Expressions In this paper, we present and evaluate an open and flexible methodological framework for the automatic acquisition of multiword expressions (MWEs) from monolingual textual corpora. We start with a pratical motivation followed by a theoretical discussion of the behaviour and of the challenges that MWEs pose for NLP applic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999